Epsilon change in normalise for stability #2421

billera · 2024-04-07T13:42:30Z

Normalise allows for an optional epsilon term aimed towards improving numerical stability. Previously the epsilon was added after computing the standard deviation of the input. The standard deviation computation involves a square root, leading to NaN's in gradients dependent on normalise when the variance is very low, and for instance LayerNorms applied to low variance inputs will result in NaN gradients. By first computing the variance and taking the square root after adding epsilon^2 (squaring to preserve scale), we prevent NaN's in gradients at low variance. See the following example with LayerNorm in the current patch.

using Flux 
using Zygote 

ln = LayerNorm(256; eps = 1f-3)
for i in 1:10 
    x = ones(Float32, 256) .+ randn(Float32, 256) .* 10f0^(-i)
    l, gs = Zygote.withjacobian(ln, x)
    @show maximum(gs[1])
end


>>> maximum(gs[1]) = 9.44178f0
>>> maximum(gs[1]) = 95.85736f0
>>> maximum(gs[1]) = 477.4946f0
>>> maximum(gs[1]) = 910.05457f0
>>> maximum(gs[1]) = 985.8402f0
>>> maximum(gs[1]) = 995.0282f0
>>> maximum(gs[1]) = 995.9835f0
>>> maximum(gs[1]) = NaN32
>>> maximum(gs[1]) = NaN32
>>> maximum(gs[1]) = NaN32

We observe that while the gradients are fixed at low variance due to the epsilon addition in the denominator, this does prevent NaN's, due to the non-padded square root in the std computation. But, when using the updated normalise, these NaN's dissapear,

>>> maximum(gs[1]) = 9.531697f0
>>> maximum(gs[1]) = 105.468056f0
>>> maximum(gs[1]) = 674.7051f0
>>> maximum(gs[1]) = 991.67163f0
>>> maximum(gs[1]) = 996.03973f0
>>> maximum(gs[1]) = 996.09314f0
>>> maximum(gs[1]) = 996.0937f0
>>> maximum(gs[1]) = 996.0937f0
>>> maximum(gs[1]) = 996.0937f0
>>> maximum(gs[1]) = 996.0937f0

and remain fixed to the implicitly capped value. A simple test verifying this computation's equivalence with the previous one (modulo the differences at very low standard deviations) could be added if desired.

codecov · 2024-04-07T14:44:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.45%. Comparing base (e1989b5) to head (fbc1186).
Report is 2 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2421       +/-   ##
===========================================
+ Coverage   33.50%   60.45%   +26.94%     
===========================================
  Files          31       31               
  Lines        1910     1942       +32     
===========================================
+ Hits          640     1174      +534     
+ Misses       1270      768      -502

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/layers/stateless.jl

CarloLucibello · 2024-04-07T16:04:13Z

I agree with this change and pytorch does the same thing. It should considered a breaking change thous, so let's wait for when we are near v01.5 before merging.

Co-authored-by: Carlo Lucibello <[email protected]>

mcabbott · 2024-04-09T14:34:26Z

Can this have a test with input which triggers the NaN behaviour before?

Ideally testing not just the function, but also LayerNorm, maybe BatchNorm, anything which uses this internally. Then if the implementation of these layers finally gets replaced, it will be harder to lose the change.

ToucheSir · 2024-04-13T02:40:29Z

Putting a backlink to #2096 because this work should close that.

mcabbott

Needs news entry?

mcabbott · 2024-11-06T01:16:25Z

src/layers/stateless.jl

-@inline function normalise(x::AbstractArray; dims=ndims(x), eps=ofeltype(x, 1e-5))
+@inline function normalise(x::AbstractArray; dims=ndims(x), eps=1f-5)


Why does this now assume Float32? Elsewhere we try to allow for Float16 too.

epsilon change for stability

da2061d

billera closed this Apr 7, 2024

billera reopened this Apr 7, 2024

CarloLucibello reviewed Apr 7, 2024

View reviewed changes

src/layers/stateless.jl Outdated Show resolved Hide resolved

CarloLucibello added this to the v0.15 milestone Apr 7, 2024

CarloLucibello added the breaking label Apr 7, 2024

Change comment for eps

b600f7a

Co-authored-by: Carlo Lucibello <[email protected]>

Merge branch 'master' into master

fbc1186

CarloLucibello merged commit 91f2d47 into FluxML:master Nov 5, 2024
5 of 9 checks passed

ToucheSir mentioned this pull request Nov 6, 2024

gradient of Flux.normalise return NaN when std is zero #2096

Closed

mcabbott reviewed Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epsilon change in normalise for stability #2421

Epsilon change in normalise for stability #2421

billera commented Apr 7, 2024

codecov bot commented Apr 7, 2024 •

edited

Loading

CarloLucibello commented Apr 7, 2024

mcabbott commented Apr 9, 2024

ToucheSir commented Apr 13, 2024

mcabbott left a comment

mcabbott Nov 6, 2024

		@inline function normalise(x::AbstractArray; dims=ndims(x), eps=ofeltype(x, 1e-5))
		@inline function normalise(x::AbstractArray; dims=ndims(x), eps=1f-5)

Epsilon change in normalise for stability #2421

Epsilon change in normalise for stability #2421

Conversation

billera commented Apr 7, 2024

codecov bot commented Apr 7, 2024 • edited Loading

Codecov Report

CarloLucibello commented Apr 7, 2024

mcabbott commented Apr 9, 2024

ToucheSir commented Apr 13, 2024

mcabbott left a comment

Choose a reason for hiding this comment

mcabbott Nov 6, 2024

Choose a reason for hiding this comment

codecov bot commented Apr 7, 2024 •

edited

Loading